Phrase Level Segmentation and Labelling of Machine Translation Errors
نویسندگان
چکیده
This paper presents our work towards a novel approach for Quality Estimation (QE) of machine translation based on sequences of adjacent words, the so-called phrases. This new level of QE aims to provide a natural balance between QE at word and sentence-level, which are either too fine grained or too coarse levels for some applications. However, phrase-level QE implies an intrinsic challenge: how to segment a machine translation into sequence of words (contiguous or not) that represent an error. We discuss three possible segmentation strategies to automatically extract erroneous phrases. We evaluate these strategies against annotations at phrase-level produced by humans, using a new dataset collected for this purpose.
منابع مشابه
A Comparative Study of English-Persian Translation of Neural Google Translation
Many studies abroad have focused on neural machine translation and almost all concluded that this method was much closer to humanistic translation than machine translation. Therefore, this paper aimed at investigating whether neural machine translation was more acceptable in English-Persian translation in comparison with machine translation. Hence, two types of text were chosen to be translated...
متن کاملIntegrated Chinese word segmentation in statistical machine translation
A Chinese sentence is represented as a sequence of characters, and words are not separated from each other. In statistical machine translation, the conventional approach is to segment the Chinese character sequence into words during the pre-processing. The training and translation are performed afterwards. However, this method is not optimal for two reasons: 1. The segmentations may be erroneou...
متن کاملA Phrase Combination Approach to Patent SMT
This paper presents a phrase combination approach to patent SMT (Statistical Machine Translation) for Japanese to English. To minimize the segmentation problems caused by the rich OOV (out-ofvocabulary) words in the patent texts, the character based translation phrases are first introduced to avoid the segmentation errors in translation modeling. Then the word based translation phrases, which a...
متن کاملPatent SMT Based on Combined Phrases for NTCIR-7
In this paper, we describe a combined phrase approach to the Statistical Machine Translation of Japanese patents into English. To resolve the segmentation errors caused by the rich OOV (out-of-vocabulary) words in the patent texts, the character based translation phrases are first employed. Then the word based translation phrases are established to utilize the dependable word level information....
متن کاملIntegrating Dictionary and Web N-grams for Chinese Spell Checking
Chinese spell checking is an important component of many NLP applications, including word processors, search engines, and automatic essay rating. Nevertheless, compared to spell checkers for alphabetical languages (e.g., English or French), Chinese spell checkers are more difficult to develop because there are no word boundaries in the Chinese writing system and errors may be caused by various ...
متن کامل